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ABSTRACT 

ALFRED (http://alfred.med.yale.edu) is a free, 
web accessible, curated compilation of allele fre- 
quency data on DNA sequence polymorphisms in 
anthropologically defined human populations. 
Currently, ALFRED has allele frequency tables on 
over 663400 polymorphic sites; 170 of them have 
frequency tables for more than 100 different popu- 
lation samples. In ALFRED, a population may have 
multiple samples with each 'sample' consisting of 
many individuals on which an allele frequency is 
based. There are 3566 population samples from 
710 different populations with allele frequency 
tables on at least one polymorphism. Fifty of those 
population samples have allele frequency data for 
over 650000 polymorphisms. Records also have 
active links to relevant resources (dbSNP, 
PharmGKB, OMIM, Ethnologue, etc.). The flexible 
search options and data display and download 
capabilities available through the web interface 
allow easy access to the large quantity of 
high-quality data in ALFRED. 



INTRODUCTION 

In this article, we are providing a detailed overview of 
ALFRED that has considerably evolved since the 
previous published descriptions >8 years ago (1-3). 
ALFRED is designed to be a resource for research and 
for education in diverse areas related to human genetic 
diversity. ALFRED'S focus is on allele frequencies in 
diverse anthropologically defined populations. It is not a 
compendium of human DNA polymorphisms, but of 
allele frequencies of polymorphisms with an emphasis on 
those polymorphisms that have been studied in multiple 
populations. It is distinct from such databases as dbSNP 
(4,5), which is an uncurated catalog of sequence poly- 
morphisms. We are not aware of any existing databases 



(private or public) other than ALFRED that attempts to 
meet the research needs of the broader human population 
genetics and molecular anthropology communities. There 
are many small and/or highly specialized databases. 
Applications such as FINDBase (6) (inherited disorders), 
STRbase (7) (forensic STRs), PharmGKB (8) 
(pharmacogenetic loci) and dbMHC (9) (HLA poly- 
morphisms) are all excellent but specialized databases. 
All the data in ALFRED are considered to be in the 
public domain and available for the use in research and 
teaching. 

Sources of data in ALFRED include the following: (i) 
data extracted from the published literature. Allele fre- 
quency data and related information are extracted from 
pubhshed papers located by ALFRED researchers and 
curators after routinely scanning the literature; (ii) data 
generated in the laboratories of K.K. and J.R. Kidd in 
the Department of Genetics at Yale, including extensive 
unpubhshed data; (iii) data submitted by collaborators or 
other researchers in electronic format; (iv) data in publicly 
available high-throughput SNP data sets such as the 
CEPH-HGDP data. Other high-throughput data are 
also being entered when possible to provide a single, 
more integrated resource. Intensive curation and data in- 
tegrity checks are performed preceding any data upload 
into ALFRED. 

Starting from our pre-existing database in 2000, we 
have progressively added more data (Table 1), improved 
the functionahty of the web interface and elaborated the 
database structure. As of August 2011, there are 
35 229 132 allele frequency tables (one population sample 
typed for one site) in ALFRED with additions ongoing on 
a regular basis. 

ALFRED continues to be supported by grants from the 
U.S. National Science Foundation to be an international 
resource for research and teaching. 

DATABASE STRUCTURE AND CONTENT 

ALFRED has been implemented using relational database 
technology (Figure 1). All data are stored in an Oracle 
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relational database management system. An individual 
polymorphism (or 'Site') is contained within a locus 
('Loci' table) on the genome. Ethnic populations 
('Populations' table) are organized by their geographic 
location ('Geographic_Region'). Multiple samples 



Table 1. The growth in contents summarized 

Records 2000 2003 2006 2009 Present 



Frequency tables 2865 11 505 39 298 1 149 836 35 230 146 
Polymorphisms 180 698 1436 -662 880 663 449 
Populations 70 352 453 690 710 



('Samples' table) may be drawn from a particular popula- 
tion. For such highly heterogeneous populations as 
African-American or European-American, special care 
is taken to delineate the specific geographic region of the 
population sample and to clearly distinguish among the 
multiple samples. The alleles at a site are in the 'Alleles' 
table. Because an allele frequency estimate is specific for a 
sample, the table 'Typed_Sample' bridges the tables 
Samples and Sites. The allele frequency values for a 
Typed_Sample are stored in the 'Frequencies' table with 
the associated typing method, which is detailed in the 
'Typing_Method' table. All publication-related informa- 
tion is stored in a single 'Publications' table and intermedi- 
ate tables are defined to hnk Publications to Frequencies, 
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Figure 1. Core Structure of ALFRED. 
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Table 2. URLs to ALFRED paj 


les mentioned in the text 


Page 


URL 


ALFRED home page 


http://alfred.med. yale.edu 


Table numbers 


http://alfred.med.yale.edu/alfred/alfredsummary.asp 
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FAQ 


http://alfred.med.yale.edu/alfred/alfredFaq.asp 


ALFRED flyer 


http://alfred.med.yale.edu/alfred/flyer/ALFREDFlyer.pdf 



Samples, Sites and Loci. Links to other web sites are 
stored in the 'URLs' table. These hnks are associated 
with the Loci, Sites, Populations and Publication tables. 
All frequency records are hnked to the contributor 
(Contributors table), which stores information about in- 
dividuals who contribute the allele frequency data. 
Detailed descriptions of the individual tables (including 
their fields) are available from 'Data structure' (Table 2). 



ACCESS TO THE DATA IN ALFRED 

Specific information in ALFRED can be accessed in 
multiple ways through the web. Users famihar with the 
Google search engine may search for an rs number, gene 
symbol, population or ALFRED UID by simply 
concatenating the string ALFRED to the search term. 
For example, a search term like 'ALFRED 
E_rsl 587 264_10' or 'ALFRED S1001677V' will fist the 
URL hnk to ALFRED'S 'Polymorphism Information' 
page for the rs number rsl 587 264. E_rsl 587 264_10 is 
the TaqMan assay in the Applied Biosystems catalog 
used to obtain the allele frequency; SlOO 1677V is the 
UID of the record in ALFRED. A simple rs number or 
gene name would work as well. Similarly, concatenating a 
population name or a specific population UID may bring 
up the Hnk to the corresponding 'Population Information' 
page. This use of Google requires prior knowledge of OTie 
of the terms in ALFRED and does not always retrieve the 
result desired but is very quick when it does work. 

Usually, specific information in ALFRED will be 
accessed through the ALFRED web interface, which 
offers multiple options. The ALFRED web site also 
allows direct access to a specific record using the 
keyword search function available on the ALFRED 
home page. Users have the option of selecting the type 
of search, 'Any part of or 'Begins with' aTid the table 
that should be searched. The option 'Any part of con- 
siders ALFRED names that contain the entered string 
of characters anywhere, while 'Begins with' only considers 
ALFRED names that begin with the entered string of 
characters. In addition, the search can be restricted to 
the database table to be searched. The resulting output 
is a comprehensive table of the different occurrences of 
the search term, the database table in which it occurs 



and a link to navigate to the corresponding description 
page. Users looking for a specific SNP with dbSNP 
refSNP Identifier (rs number), gene symbol or a popula- 
tion can take advantage of this search option. 

A more generalized method for searching ALFRED 
without specific prior criteria is by following the two 
options under the tabbed menu item Search: Loci and 
Population. The returned results are organized as 
follows. Loci are organized both in genomic order by 
chromosome and molecular position as well as in alpha- 
betic order. Following either of the options, selecting a 
locus will then bring the user to the specific Locus 
Information page. Each locus record is annotated with 
alternate names (synonyms), chromosomal position, a 
valid HUGO Nomenclature Committee locus symbol 
and links to external databases such as Entrez Gene, 
UniGene, OMIM, PharmGKB and Genopedia (HuGE 
Navigator). Genetic polymorphisms and haplotypes 
ordered by chromosomal position in the selected locus 
are displayed in a table. For example, see (http://alfred 
.med.yale.edu/alfred/recordinfo.asp?UNID = LO000422I). 
A polymorphism or haplotype can be selected to navigate 
to the Polymorphism Information page. Each polymorph- 
ism record is annotated with dbSNP rs Tiumber, alternate 
names (synonyms), ancestral allele and links to external 
databases such as dbSNP and PharmGKB for expanded 
molecular information. For example, see (http://alfred 
.nied.yale.edu/alfred/recordinfo.asp?UNID = SI000002C). 
Populations are organized by geographic regions and se- 
lecting a population will bring the user to the correspond- 
ing Population Information page. Each population record 
is annotated with alternate names (synonyms), Hnguistic, 
geographical location information and links to external 
databases such as Ethnologue Language and Map 
Projects for additional information. Active hnks to other 
databases provided from ALFRED'S populations, loci, 
and sites information pages facilitate easy retrieval of add- 
itional information. For example, see (http://alfred.med 
.yale.edu/alfred/recordinfo.asp?UNID = PO000036J). 

Population samples are organized by populations and 
annotated with sample information such as sample size 
and relation to other samples. The wiki implementation 
for ALFRED 'ALFRED Wiki' (Table 2) allows users to 
interact with ALFRED curators and get involved in 
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Figure 2. Different allele frequency display formats for rs2 066 701 of ADHIB gene. 



annotating ALFRED populations. ALFRED curators are 
responsible for comparing between different wiki update 
versions and adding relevant information to the popula- 
tion descriptions in ALFRED. We invite ALFRED users 
to participate in this effort of population annotation. 
Users are required to create an account and log in to be 
able to edit the ALFRED wiki pages. Contact us using 
our feedback function on the ALFRED home page to 
create an account. 

Allele frequency records are accessed from the corres- 
ponding site (Polymorphism Information) page. Display 
formats available are graphical, tabular and pie-chart on 
Google Map (Figure 2). The graphically stacked-bar 
format offers a quick visual display of the frequency vari- 
ation among populations (http://alfred.nied.yale 
.edu/alfred/mvograph.asp?siteuid = SI001272M). Each 
allele frequency record displayed is hnked to the popula- 
tion sample information, polymorphism information, 
typing method and the publication the frequency was ex- 
tracted from. Most publication entries are hnked to 
PubMed for complete citation and possible links to the 
full publication. For diallelic polymorphisms, the web 
page also provides a table with the calculated Fst, 
average heterozygosity (measures of genetic variation) 
and number of populations with data available in 



ALFRED for the selected site. The graphical stacked-bar 
format and the pie-chart on Google Map offer quick visual 
displays of the frequency variation among populations 
(http://alfred.med.yale.edu/alfred/mvograph.asp7siteuid 
= SI014485W). On the other hand, the tabular format 
gives the frequency values and related information, 
which can be used in analyses (see also Downloads), 
(http : / /alfred .med . y ale . edu /alfred/SiteTable 1 A_working 
.asp?siteuid = SI001272M). 

Every record in ALFRED has a unique identifier (UID) 
that can be the basis of a search; the UID search option is 
under the Search tab. While this search option is not used 
very often, it can be very effective. The UIDs are a text 
string consisting of three parts: for example, LO000423J is 
the UID for the locus ADH4 (the prefix 'LO' indicates the 
UID refers to a locus, the suffix J is the Check Character 
and 000423 is a number generated by the system when the 
record is created). Searching ALFRED with an UID (Site, 
Population, Locus or Sample) wiU bring the user to the 
corresponding 'Information' page. We have found it very 
useful in human interactions to have the human interpret- 
able prefixes as part of the UID schema. Similarly, the 
check character helps prevent a false retrieval that could 
result from a numeric typo. The SNP sets page under the 
'Search' tab facilitates user access to defined SNP sets 



D1014 Nucleic Acids Research, 2012, Vol. 40, Database issue 



published for ancestry inference and forensic individual 
identification. The markers in each of these SNP sets are 
annotated with relevant information including the locus 
name, rs number, Fst, average heterozygosity and the 
number of populations for which there are data available 
in ALFRED. The page Usting all of the SNPs in a set has 
options for sorting by each of those values. Each record in 
a set links out to locus description page, site description 
page and to the 'Google Map' with pie-chart distribution 
of the allele frequencies (see also Downloads). 



DATA DOWNLOAD FORMATS 

Several options are available for retrieving data from 
ALFRED for various analyses by the user. Every individ- 
ual Polymorphism Information page allows allele fre- 
quency download in several formats including both 
tab-delimited text and the input file format for the popu- 
lation genetics software package 'Arlequin'. These 
download options yield data comparable to the tabular 
display format; each record gives the population name, 
the sanipleUlD, and the frequencies of the two alleles. 
The download will include a record for every sample for 
which there are data. The field 'entryDate' in the file can 
be used to distinguish between allele frequencies on the 
same sample. A complete allele frequency data dump 
can be obtained by downloading the 'alfredFreq.zip' or 
'alfredFreqByChrom.zip' zipped files. The tables are in 
text format (tab-delimited), which can easily be parsed 
and opened in any text editor or MS Excel spreadsheet. 
Similarly, all the sites and related information from the 
Sites, Loci and Allele tables can be obtained by download- 
ing 'alfredPolymorpliisms.zip', while the Populations table 
is in 'alfredPops.zip'. All these files can be downloaded 
from 'Downloads' (Table 2). Allele frequency tables for 
selected SNP sets can be downloaded from the 
'Downloads' page as well. As new interesting SNP sets 
are added to ALFRED the data will be made available 
for download. The zip files are updated on every Friday. 



LINKING TO ALFRED 

In addition to the files hsted above, two mapping tables 
can be downloaded: one maps ALFRED UID for loci to 
Entrez Gene Id (ALFREDGenelnfo.csv), and the other 
maps ALFRED UID for sites to dbSNP rs number 
(ALFREDVariantlnfo.csv). Very often related resources 
on the web are interhnked by providing URLs to and 
from relevant pages. These mapping tables will facihtate 
easy creation of URLs to ALFRED. Based on UIDs, 
anyone can create URLs to locus and site description 
pages in ALFRED using the following format: 
http://alfred.med.yale.edu/alfred/recordinfo.asp7UNID = 
<UID> (where <UID> will be replaced by the actual 
UID value). The above-mentioned two mapping tables 
have facihtated reciprocal URLs from PharmGKB, and 
CDC's HuGE Navigator (10). In addition, reciprocal 
URLs from the dbSNP rs number page to ALFRED'S 



Polymorphism Information page are maintained by peri- 
odically submitting a dbSNP-specified XML file. 

HIGHLIGHTS OF DATA IN ALFRED 

Over the years, there have been several interesting allele 
frequency additions to ALFRED. 

High-throughput data sets in ALFRED worth mention- 
ing are: 

• Over 350 autosomal short tandem repeat polymorph- 
isms typed on the CEPH-HGDP human diversity 
panel, which includes 51 worldwide populations. 
These polymorphisms are located throughout the 
genome (11); 

• Over 11 555 SNPs typed on 14 populations (12); 

• Over 650000 common SNPs typed by Illumina tech- 
nology (650Ypanel) on the CEPH-HGDP panel of 51 
populations (13). In addition, 876 markers from this 
set typed on 46 Kidd Lab population samples are in 
ALFRED; and 

• Over 2800 SNPs typed on the CEPH-HGDP panel 
and an additional two Indian populations (total of 
55 samples) (14). 

Other smaller but interesting data additions to ALFRED 
(allele frequency tables for these sets are available from the 
'Downloads' page): 

• Thirty-four-plex assay markers data on the 
CEPH-HGDP panel from Phillips et al. (15). In 
addition, for these markers data typed on 46 Kidd 
Lab populations are in ALFRED bringing the total 
to 98 population samples; 

• Fifty-two 'SNPforlD' markers typed on 16 population 
samples from Sanchez et al. (16). Several of these 
markers have subsequently been typed on additional 
populations and data will be added to ALFRED; 

• 'LowFst' markers of forensic interest typed on the 
Kidd Lab population panel (17, 18); 

• One hundred and twenty-eight ancestry informative 
markers typed on 73 Kidd Lab populations (19); 

• Various interesting polymorphisms associated with 
human traits (20, 21); 

• Polymorphisms associated with 'lactase persistence' 
(22); and 

• TAS2R16 gene-coding polymorphisms typed on the 
HGDP-CEPH panel (23) and Kidd Lab populations 
(24). 



USER INVOLVEMENT 

We encourage users to communicate with us on the inter- 
face or any data contained in ALFRED using the 
'Feedback' page. Allele frequency data can be submitted 
to us electronically by following the directions in the 
guidehnes for 'Data submission' (Table 2). 

Comprehensive and up-to-date documentation of the 
contents and navigation tips can be obtained from 'Tour 
ALFRED', 'About ALFRED', 'ALFRED FAQ' and 
'ALFRED flyer' (Table 2). 
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FUTURE DIRECTIONS 

The number of records in ALFRED will continue to grow 
as allele frequency data for new population samples and 
SNPs are made available. During the coming month's, 
data from the Illumina 650Y panel wiU be entered for 
several additional populations. Also, data download 
options of a user-selected set of SNPs and populations 
will be implemented. We also hope to enhance the 
didactic value of the database. On these and other direc- 
tions for the future, we welcome comments and sugges- 
tions toward better meeting needs of the community. 
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